Topic Modelling: A beginner's guide

May 18, 2021

Introduction

Natural Language Processing (NLP) is an interdisciplinary field of study that focuses on the ability of computers to understand and generate human language. It has been employed in various applications, including sentiment analysis, machine translation, language generation, and text summarization. One of the most crucial tasks of NLP is topic modelling, which involves discovering hidden topics in a textual dataset. Topic modelling has been employed in various domains, including political science, digital humanities, and sociology, among others. This blog post provides a beginner's guide to popular topic modelling techniques, including a factual and objective comparison of their strengths and weaknesses.

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) is one of the most popular topic modelling techniques currently in use. It is a generative probabilistic model that imagines each document as a collection of topics and each topic as a collection of words. LDA treats each document as a mixture of latent topics, with each topic represented by a probability distribution over words. The number of topics can be set manually or inferred automatically using statistical techniques such as perplexity or coherence measures.

Strengths

  • LDA provides a probabilistic framework for modelling topics, making it easier to interpret the results.
  • LDA can handle a large number of documents and words, making it suitable for big data applications.
  • LDA can infer the number of topics automatically, which is particularly useful in scenarios where the optimal number of topics is unknown.

Weaknesses

  • LDA assumes that topics are independent of each other, which might not be true in some cases.
  • LDA is sensitive to the choice of prior distributions, which can affect the quality of results.
  • LDA does not model the temporal dynamics of topics, which can be important in some applications.

Non-negative Matrix Factorization (NMF)

Non-negative Matrix Factorization (NMF) is another popular topic modelling technique that has been widely employed in NLP. NMF represents each document as a linear combination of topics, with each topic represented by a non-negative matrix of words. NMF minimizes the error between the original document matrix and its low-dimensional approximation using non-negative constraints.

Strengths

  • NMF provides a non-negative decomposition of documents, making it easier to interpret the results.
  • NMF has a low computational complexity, making it suitable for applications with limited computational resources.
  • NMF can handle missing data, which is particularly useful in scenarios where the textual data is incomplete.

Weaknesses

  • NMF requires the number of topics to be set manually, which can be challenging in some cases.
  • NMF does not produce a probabilistic framework for modelling topics, making it harder to interpret the results.
  • NMF is sensitive to the initialization of topic matrix, which can affect the quality of results.

Conclusion

In conclusion, both LDA and NMF are popular and effective topic modelling techniques that have been widely employed in NLP. LDA provides a probabilistic framework for modelling topics, making it easier to interpret the results. On the other hand, NMF provides a non-negative decomposition of documents, making it easier to interpret the results. Both techniques have their strengths and weaknesses, which should be taken into account when selecting the appropriate technique for a specific application.

References

  1. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine learning research, 3(Jan), 993-1022.
  2. Lee, D. D., & Seung, H. S. (1999). Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755), 788-791.
  3. Phan, X. H., Nguyen, L. M., & Horiguchi, S. (2008). Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In Proceedings of the international conference on web search and web data mining (pp. 91-100).

© 2023 Flare Compare